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Two algorithms capable of computing a transonic 3— D inviscid flow field 
about rotating machines are considered for parallel implementation. During 
the study of these algorithms, a significant new method of measuring the per- 
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new method creates an empirical definition of scalable parallel algorithms 
that is used to produce quantifiable evidence that a scalable parallel applica- 
tion has been developed. The implementation of the parallel application and 
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CHAPTER I 
INTRODUCTION 


Using multiple computational engines concurrently in order to increase 
the performance of computing platforms is a strategy that has been in exis- 
tence almost since the advent of computer systems. Computer architecture 
has felt the impact of this strategy in the form of pipelined and vector architec- 
tures, interleaved memory systems, floating point accelerators, input and out- 
put processors, and so on. Many of these “parallel” computers derived their 
design principles from observations of existing applications. When a large 
class of applications exhibit characteristics that could provide significant per- 
formance enhancement by concurrent utilization of computational resources, 
computer architectures were devised to take advantage of these characteris- 
tics. Applications that did not exhibit these characteristics adapted to the new 
computer architectures by modification of the implementation of algorithms 
and by the development of new algorithms. Automation of some of the pro- 
cesses involved in adapting an application to new computer architectures 
helped to speed the acceptance of these platforms by a wider audience of ap- 
plication developers. 

The performance gains that pipelined and vector architectures can pro- 
vide has begun to stabilize as the these technologies mature. In addition, per- 
formance gains resulting from new technologies that create faster electronic 
devices soon will be reaching physical barriers relating to device sizes and the 
maximum theoretical speed that signals can propagate. 
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General parallel computer architectures such as multicomputer en- 
sembles have been suggested as an approach to gain higher performance com- 
putational platforms in light of the future limitations traditional computer ar- 
chitectures will encounter. Since general parallel computer architectures are 
designed based on the assumption that traditional architectures will soon 
reach theoretical limits rather than exploiting characteristics of existing ap- 
plications, it is not surprising to discover that general parallel computer archi- 
tectures place a greater burden on application developers than traditional 
computer architectures. Key to the success of general parallel computer archi- 
tectures is the development of applications that can use the resources these 
computer architectures provide. 

This thesis advances the science of general parallel computing in two 
important ways, 1) it introduces a new method for measuring and classifying 
parallel algorithms, and 2) it contributes to the base of applications that can 
utilize the resources general parallel architectures provide. 

The method for measuring parallel algorithms that was developed in 
this thesis arose from the study of two parallel algorithms that are presented 
in Chapter II. One of these algorithms was developed to utilize vector archi- 
tectures and the other was originally developed for purposes other than paral- 
lel processing. In this thesis, the algorithm that was developed for vector ar- 
chitectures is referred to as the parallel approximate factorization algorithm 
and the other as the modified algorithm. Intuition and experiences suggested 
that the modified algorithm would be superior in the context of massively par- 
allel architectures but developing a solid argument for this case proved diffi- 
cult. Attempts to use common metrics of parallel algorithm performance to 
make this case provided ambiguous results. The experience gained from this 
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analysis led to a new perspective on parallel algorithm evaluation. This new 
perspective, based on the cost effectiveness of parallel computations, resulted 
in a new method for measuring parallel algorithms that provided unambigu- 
ous results. These findings are presented in Chapter III. 

Chapter IV presents a summary of the details involved in the imple- 
mentation of the modified algorithm developed in Chapter II and analyzed in 
Chapter III. 

Chapter V presents the results of numerical experiments needed to 
complete the arguments presented in Chapter III, and finally Chapter VI dis- 
cusses the implications of this research and suggests areas where future in- 
vestigation is needed. 



CHAPTER II 

AN OVERVIEW OF THE NUMERICAL ALGORITHM 


The first step in the process of developing an application for parallel 
processors is the identification of the components of an application’s structure 
which impede or aid in accomplishing the goal of solving the problem on a 
number of processors concurrently. For the specific task of parallelizing an 
existing application, analysis of the application falls roughly into two catego- 
ries: 1) the analysis of the general structural constraints presented by the 
computational model; and 2) the analysis of the specific constraints presented 
by the implementation of that computational model. 

Before analysis of the computational model can begin, the computation- 
al model must be presented. This chapter will present aspects the computa- 
tional model developed by Janus[l] that are relevant to the parallel imple- 
mentation. The reader interested in the complete derivation of the 
computational model is referred to Janus’s dissertation!!]. 

The Conservative Differential Form of the Euler Equations 
The application of interest involves seeking a numerical solution to the 
Euler equations, a subset of the Navier— Stokes equations given the simplify- 
ing assumptions of a thermally nonconducting inviscid perfect gas with no 
body forces. The Euler equations can be written in conservative form where 
no dependent variables are present outside of the differential operators. In 
the conservative form, errors produced by discretization operators will not 
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cause a violation of the conservation properties expressed in conservation law ' 
equations. The conservative form of the Euler equations written for the Carte- 
sian coordinate system can be expressed as 


dq , df(q) , dg(q) , dKq) _ n 
dr dx ^ dy + dz ~ a 


( 2 . 1 ) 


In equation (2.1), q represents the five dependent variables: mass, three 
components of momentum, and energy, whereas fl[q)> g(q), and h(q) represent 
the flux vectors in each of the coordinate directions x, y, and z respectively. It 
should be pointed out that the actual equations solved in the present applica- 
tion are the Euler equations cast in a curvilinear coordinate system. Although 
this increases the complexity of the equations somewhat, it does not change 
the data dependencies that exist between computational cells. Since the cur- 
vilinear derivation would only serve to obscure the issues of consequence to 
the parallel algorithm evaluation, the derivation of the numerical model will 
be presented in the Cartesian coordinate system. 


Discretization 

Applying a first-order implicit temporal discretization with a finite vol- 
ume spatial discretization to equation (2.1) yields 


43 * 

At Ax 




Ay 


+ 


Az 


= 0 , 


( 2 . 2 ) 


where 


Aq" = q n+i — q n t 


= mu i,, W = 




d 2 h(q) n+1 = h(qy + '.-h(qy+' 


■w+f 
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The choice of implicit time integration in the derivation of equation 
(2.2) is the most significant with respect to parallelization of this algorithm 
because this choice introduces multiple unknowns (represented by functions 
evaluated at the n + 1 time level) into the discretized equation for a computa- 
tional cell. Given the discretized equations for every cell, there are N equa- 
tions and N unknowns that when linearized can be written as a linear system 
of the form Ax = b. A is a matrix that contains the coefficients of the un- 
knowns, x is a vector of the unknowns referred to as the solution vector, and b 
is a vector containing a collection of known values. As an emphasis to the con- 
nection between the discretized equations and the linear system that results, 
the unknowns are grouped to the left-hand-side (LHS) of the equation and 
the knowns are grouped to the right-hand-side (RHS). 

Solving the linear system that results from implicit time integration is 
difficult, even on sequential processors. In general, much effort is expended to 
adjust the form of the LHS so that a kernel of efficient linear system solvers 
such as tri— diagonal solvers or LU factorization solvers can be employed. Un- 
fortunately, these schemes usually rely on backward substitution steps that 
create data dependencies forcing sequential or near sequential execution. 
Given these problems, the form of the LHS and the methods chosen to solve 
the linear system that results will be a subject of the present investigation. 

Higher order temporal discretizations of equation (2.2) can be derived 
using a parameterized form developed by Beam and Warming [2]. The higher 
order schemes, which require the storage of Aq and residual terms from the 
previous iteration, primarily affect the RHS and as such have little impact on 
the form of the linear system. In fact, because a first order flux Jacobian ex- 



trapolation is used on the LHS even in schemes of higher order of accuracy, 
order of accuracy will not significantly change the results of the parallel algo- 
rithm analysis of the first order derivations given here. 

Linearization 

Equation (2.2) can’t be used to construct a linear system of equations 
because the flux vectors /fa), g(q), and h(q) are non-linear functions of q. One 
approach to overcoming this problem is to approximate the flux vectors with 
functions that are linear combinations of the unknown variable < 7 n + 1 . This 
process, known as flux vector linearization, is accomplished by truncating the 
Taylor series for the function about time level n at the second term. This pro- 
cess yields the linearized flux vectors 

m n+1 - m n + fa " +1 - q n ), 

g(q) n + l = g(q) n + ( q n+l ~ q n ), (2.3) 

h(q) n+i = Kq) n + fa n+1 - q "). 

The derivative terms in equation (2.3) are called the flux Jacobians. 
The flux Jacobians are 6 by 6 matrices represented by the naming convention 



Substituting the linearized flux vectors of (2.3) into the discretized equation of 
(2.2) yields the equation 

Aq n , <5 Mq)" + A(4q n )) , <5 /$(<?)" + B(Aq n )) dMq)" + C(Aq n )) „ JX 
At + A~x + Ty + Tz = 0 . (2.4) 
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Collecting coefficients of the unknown terms to the LHS gives the equation 

(' + + d -^f + d ~T)- ™ 

Although equation (2.5) represents the general formulation of the dis- 
cretized equations used in the Euler 6olver, it is not complete. Before continu- 
ing, observe the dots following the flux Jacobians on the LHS. The dots imply 
that the finite volume difference operator is applied after the dis tribution of 
Aq. The key ambiguity that remains in equation (2.5) arises from the fact that 
the finite volume difference operator is extracting information from cell faces, 
while the solution vector is in terms of the dependent variables defined to be 
located at cell centers. This implies that one needs to choose a method to ex- 
trapolate information from cell centers (where it is assumed known) to cell 
faces (where it is assumed unknown). An obvious extrapolation method would 
be an averaging process of neighboring Aq values. Unfortunately, it is known 
that this approach leads to schemes that tend to blur (or smear) shocks. On 
the other hand, an upwind extrapolation scheme derived from the mathemati- 
cal character of the physics (which dictates that information is propagated in 
preferred directions) yields a solver that resolves shocks more sharply. How- 
ever, before this approach can be taken the flux vectors must be reformulated 
in order that spatial difference operators can be defined to allow information 
to propagate in these preferred directions. 

Flux Vector Splitting 

The Euler equation’s flux functions are homogeneous of degree one. 
Steger and Warming[3] exploited this observation to redefine the flux function 
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in terms of the flux Jacobian. For example, the x-direction flux, f(q), can be 
written as 


f(q) = Aq. 

Where A is the flux Jacobian matrix computed as before, 

dq 


( 2 . 6 ) 

(2.7) 


As shown in numerous references (see references [4] and [6]) the flux 
function can be split based on the signs of the eigenvalues of the flux Jacobian 
matrix A. Using the Steger— Warming[3] approach, flux Jacobians can be split 
and applied to equation (2.6) arriving at the equation 


/(<?) = A^q + A<->q = f+{q) + f~(q), (2.8) 

where the eigenvalues of A (+) are non— negative and those of /4 (_) are non— posi- 
tive. 


Equation (2.8) can then be linearized in the same way as equation (2.3) 
and substituted in the discretized of equation (2.2) to arrive at 


(l + f x ^- + f x 6A-- + f y W*- + f y W- + f 2 6^- + f 2 l,C-^ Sq ’’ = 


-JUIW | , 6 yS + (^ n , ^g~(4)" d z h+(q) n , S z h~(qr \ 

\ Ax Ax Ay Ay Az Az J ' 

Note the notational difference in the flux Jacobian terms of (2.8) and (2.9) in- 
dicated by 


X (+) = TA^T - 1 * A + = 

dq ’ 


( 2 . 10 ) 


where the matrices T, A^ + \ and T 1 are presented elsewhere[4]. 

Remember the purpose of splitting the flux functions was to accommo- 
date an upwinding extrapolation scheme based honoring the proper direction 
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of information propagation. The reason why Steger— Warming flux splitting 
helps accomplish this can be seen by drawing an analogy between the quasi— 
linear and characteristic variable formulations of the Euler equations. The 
eigenvalues of the characteristic equation can be seen as wave speeds where 
the signs of the eigenvalues represent the direction of wave propagation. It 
can be shown by analogy to the quasi— linear form of the Euler equations that 
the eigenvalues of the conservative flux Jacobians are identical to the eigenva- 
lues of the characteristic equation, thus the signs of the eigenvalues represent 
the direction of information propagation. Given this analysis, we can 
construct a first order extrapolation by which the flux vector q information is 
“moved” from cell centers to cell faces in the characteristic directions dictated 
by the signs of the corresponding eigenvalues. For example, using this extrap- 
olation technique, a first-order finite volume discretization operator on the 
LHS can be written as 

* A i) - ( A * A 1) W - ( A . ( 2 . 11 ) 

6M -Aq) = 04 -A - 04 . ( 2 . 12 ) 

A scheme that uses Steger— Warming flux splitting theory has now been 
derived. For completeness, it should be pointed out that the Euler solver de- 
veloped by Janustl] utilizes a more advanced splitting scheme on the RHS 
known as flux difference splitting. The specific scheme used is the approxi- 
mate Riemann solver originally developed by Roe[6] and implemented by Ja- 
nustl]. Unfortunately, linearizing the fluxes that result from the Roe scheme 
has proved difficult primarily because the Jacobians associated with Roe 
fluxes are nearly impossible to determine analytically. As a compromise to 
computing the Roe flux Jacobians numerically, the present application uses 
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Steger— ' Warming flux vector splitting theory on the LHS, and Roe flux differ- 
ence splitting theory on the RHS (The rational for this compromise is pres- 
ented in reference [7]). 


Approxim ate Factorization 

Examination of the LHS of equation (2.9) yields the observation that 
there is a linear combination of seven unknowns: the Aq associated with a giv- 
en cell plus those associated with its six neighboring cells. Taking advantage 
of the topologically orthonormal grid structure implied by the six point stencil 
and assuming explicit treatment of the boundary conditions, an ordering of 
the equations for the cells can be found such that the resulting/! matrix (of the 
system Ax=b) defined by the LHS will be block 6eptidiagonal, and thus is ex- 
pensive to solve. A less computationally expensive method is derived by ap- 
plying an approximate factorization to the LHS of equation (2.9) yielding the 
equation 

(' + ' + %*** ' + •)(' Ji 6 ' C ~ A = 

_ A a/ + (4) n + 6J-(q) n + V + (g) n + V~(7) n + W?)" + M~(g)" \ (2 13) 

l Ax Ax Ay Ay Az Az J ' 

Because of the structure of each factor in the LHS of equation (2.13), this fac- 
torization can be viewed as an approximate LU decomposition of the matrix 
defined by the LHS of equation (2.9). 

A two step procedure to solve this system can be implemented as follows 

(' + fM* - + fy 6 ’*' '■ + K - • 


(2.14) 
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( / + 2^M ' + ‘ + ^ c ^Aq n = Aq*, (2.15) 

where Aq represents an intermediate product of the factorization. 

The approximate factorization scheme introduces error terms in the 
LHS of the numerical derivation. In the steady state case it is assumed that 
these errors will be driven out as time-steps advance because the Aq terms 
are driven to zero. Similar arguments are made for the unsteady case when 
Newton iterations are applied. It has also been shown by way of numerical 
experimentation that this two step scheme appears to be stable for large CFL 
values[8][9]. 


Newton Iterations 

The discretized form of the Euler equations that appears in equation 
(2.13) can be used to find a steady state solution to the Euler equations by ad- 
vancing the equations in time until the Aq n solution values approach zero, 
thus giving a steady state solution. At this point, the terms that appear on the 
LHS have no effect on the solution (or the error). However, this is not the case 
when unsteady solutions are of interest. One approach to constructing a LHS 
implicit operator that does not effect the RHS in a way that is compatible with 
the derivations thus far can be illustrated by noting that the linearization pro- 
cedure indicated m equations (2.3) is strikingly similar to that of a Newton 
method iteration. 

The vector form of the Newton method can be written as (refer to [10]) 
L'(q p ~ l )(qP - qP~ l ) = (2.16) 
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For the present case, the function L(q) given by equation (2.2) is used for 
the RHS of equation (2.16), or 


U.q n+l ) = 


-,n 4- 1 


- q n + A t 


( 


Ax 


n+ 1 


dyg(q) n &Mq) 


n+l 


Ay 


Az 


(2.17) 


Using equation (2.17) in equation (2.16) and applying an approximate factor- 
ization yields the discretized equation given by 


(' + * ■ + • + fix* fi,c- .y q - 


+ 1 V 7 — - 


q" - q n+1 - p - zl/(y? n+1 'P) , 


where 


Aq n+x 'P = ^n + ij>+i _ qti + \4> , and 

fin + lj - ^/ + (g) n + 1 ' f , <?/~(^) n + 1 ^ dyg + (q) n+l -P 

Ax Ax + Ay + 

dyg-{qr^-P d z h + (q) n + i -P <5 z h~(q)^^P 

Ay + ~A Z + -Jl • (2.18) 

Equation (2.18) describes a two step solution procedure much like equa- 
tions (2.14) and (2.15) that is implemented as follows 

( 7 + A^ 6xA+ ' + Aj 6yB + ' + Az 6zC+ = q n - q n+i * - At(R n+l ^ ,(2.19) 

+ + + =Aq\ (2.20) 

Equations (2.19) and (2.20) represent the final form of the Euler solver 
used herem. Before the Newton iterations can begin, an initial condition must 
be established. If the last time-step values (time level n) are chosen as initial 
conditions for the first Newton iteration (time level n+l), then the first step of 

the Newton iteration is identical to equation (2.13) and thus assumed to repre- 
sent a reasonable initial condition. 
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— e Form of the Resulting Linear system 
As suggested earlier, the form of the linear system that is implied by the 
implicit operators on the LHS plays a significant role in the analysis of the 
parallel algorithm. The form of the linear system is dictated by the implicit 
equation for each computational cell and by the ordering of variables in the 
solution vector. If the variables of equation (2.18) are ordered in the 

solution vector as Fortran orders array entries in physical memory, the result- 
ing linear system is captured by a set of banded matrices as shown in 
Figure 2.1. In this figure, the dark central diagonal line represents the place- 
ment of the sum of the identity matrix and flux Jacobian matrices of equation 

(2.18) and the gray lines represent the placement of the remaining flux Jaco- 
bian matrices. 



The two step method that results from the approximate factorization 
method given in equations (2.19) and (2.20) describe a process that involves a 
forward and backward substitution. The forward substitution that occurs 
during the solution of equation (2.19) is illustrated in Figure 2.2. In this ini- 
tial step of the approximate factorization solution, the solution vector is com- 
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puted from top to bottom by algebraic substitution. The second step of the 
approximate factorization solution involves a backward substitution that pro- 
ceeds from the bottom to the top of the solution vector. These substitution 
steps, as they have been described, are sequential processes and as such domi- 
nate execution time when this algorithm is implemented on parallel computer 
systems. 


T — 

— — 


— — i 

\ 




\ \\ 

V 

= 

RHS 

\\ \ 




\\\ 





Figure 2.2 A Form of the Linear System Implied by Equation (2.19) 

The Para llel Approximate Factorization Algorithm 
When implementing the approximate factorization algorithm described 
by equations (2.19) and (2.20) the substitution procedures become the critical 
obstacle to obtaining high parallel performance. The reason the substitution 
procedures are essentially sequential operations can be found by examining 
the form of the linear system in Figure 2.2. Note that one band of the banded 
matrix is placed in close proximity (actually it is immediately adjacent) to the 
main diagonal. The adjacency of the first band forces a strict dependency (a 
recursive relationship in vector processing terms) in the description of each 
substitution step. It is possible, however, to move the bands away from the 
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main diagonal by ordering the equations in the linear system based on diago- 
nal planes as presented in [1] and [11]. The form of the linear system when 
diagonal plane ordering is chosen for the solution vector of equation (2.19) is 
illustrated in Figure 2.3. The dark gray boomerang shaped region indicates 
the region of placement for the flux Jacobians and the light gray stair-step 
line indicates how the substitution procedure can be divided into a sequence of 
parallel operations. In this stair-6tep, each step represents a step in the sub- 
stitution process where all of the equations covered by the rise of the next step 
can be solved simultaneously. 



Figure 2.3 The Form of the Linear System Implied by Equation (2.19) 
Given a Diagonal Plane Based Ordering of the Solution Vector 


Observations regarding the impact of diagonal plane ordering on the 
substitution procedure led to the development of an algorithm for the Cray 
vector architecture that demonstrated significant performance gains over the 
previous sequential substitution procedures[ll]. Since the diagonal plane or- 
dering allows many equation substitutions to occur simultaneously, this algo- 
rithm may also provide performance gains on general parallel architectures. 
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The Modified Algorithm 

For reasons that will be described in Chapter III, the parallel approxi- 
mate factorization algorithm is not appropriate for massively parallel comput- 
er systems. In light of this fact, an alternative approach to obtaining greater 
parallel performance in the substitution passes of the approximate factoriza- 
tion algorithm must be considered. One possible approach is to partition the 
solution vector and “decouple” the partitions so that substitutions can occur in 
each partition simultaneously. The partitions of the solution vector are de- 
coupled by zeroing out entries in the linear system such that each partition 
can begin substitution processes independently. The decoupling process is il- 
lustrated when two partitions are considered for equation (2.14) in Figure 2.4. 
In this figure, the entries that fall in the shaded region are replaced with zeros 
m order to decouple the lower partition from the upper partition. A s imilar 
procedure would be applied to the backward substitution step described by 
equation (2.16). Although this decoupling introduces errors in the solution 
process, it is assumed that the Newton iterations will be able to reduce these 
errors in much the same way as these iterations reduce the error introduced 
by approximate factorization. In addition, Belk[12] demonstrated that conver- 
gence can be obtained when this method is used and that the location of 
shocks wiU be correct even when they cross partition boundaries. 

This approach, identified as the modified algorithm, can provide addi- 
tional parallelism, but at the cost of additional Newton iterations. For the 
three and four partition cases that Belk studied, the number of time-steps re- 
quired to achieve a steady state solution was increased by 25%[12], A 25% 
penalty seems a reasonable price to pay considering that a four partition case 
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potentially could extract a four fold increase in performance on a parallel pro- 
cessor. A more comprehensive study of the cost effectiveness of this method 
will be a subject of investigation in this thesis. 



Figure 2.4 An Illustration of Decoupling Solution Vector Partitions 


Implementation of the Modified Algorithm 
It would seem that locating and zeroing the entries in the linear system 
for the modified algorithm might be a complex process. In the actual imple- 
mentation the entries in the linear system are not actually zeroed, instead the 
coefficients (represented by the Aq values in the solution vector) of the flux 
Jacobians are zeroed as necessary when the substitution passes proceed. This 
process is accomplished by way of domain decomposition. In the domain de- 
composition approach, the grid is subdivided into a number of domains (or 
blocks) and these domains define the partition of the solution vector. When 
the forward or backward substitution occurs in each domain, zero Aq values 
are injected into the substitution procedure at domain boundaries resulting in 
an algorithm that is simple to implement. 



CHAPTER III 

ANALYSIS OF THE PARALLEL ALGORITHM 


The objective of a parallel algorithm analysis is to make a value judge- 
ment based on the appropriateness of an algorithm for implementation on par- 
allel computer systems. An essential component of this value judgement, and 
resulting categorization, is performance measurement captured by a set of de- 
fined performance metrics. The most common performance metric, the rate of 
operations performed per second, is valuable for assessing an algorithm’s cost 
effectiveness for a given problem on a given hardware platform, but it pos- 
sesses virtually no predictive capability on how an algorithm will perform in 
any other instance than the one that was measured. For sequential machines, 
predictive performance metrics are derived from memory access patterns, ra- 
tios of integer to floating point operations, and other measurements that can 
be used to model how subunits of a sequential architecture will be ut iliz ed. 
For parallel architectures the definition of new predictive performance met- 
rics are evolving. This chapter will discuss several predictive performance 
metrics and adapt them to demonstrate that the parallel approximate factor- 
ization algorithm described in Chapter II is a poor algorithm for massively 
parallel implementation. The suitability of the modified algori thm will also be 
discussed. 
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Amdahl’s Law 

A m da hl s Law[13], a frequently referenced criterion for analysis of par- 
allel algorithms, is also the source of the popular parallel performance metric, 
speedup. Speedup is defined as the ratio of the time required to execute an 
algorithm on one processor to the time required to execute the algorithm on N 
processors. Amdahl argued that if s percent of an algorithm’s execution time 
was inherently sequential, and p percent of an algorithm’s execution time was 
completely parallelizable, then the algorithm could at most execute in l/s of 
the time required by a sequential processor. This argument is easily demon- 
strated given the time required to execute the sequential program will simply 
be the total s + p = 100%, and the time required to execute on N parallel pro- 
cessors will be j + p/N. Given this, the computation of the performance met- 
ric speedup is easily expressed as 


sequential execution time s + p \ 

speedup = — - — — : = 1— = k foil 

parallel execution time s + p/N s + p/N ' v ; 

The maximum speedup can be determined by taking the limit of equa- 
tion (3.1) as the number of processors approaches infinity given by 

maximum speedup = », { 1 

y N ~* co \s + p/N 

This argument illuminates the severe limitations that parallel proces- 
sing places on its applications since an algorithm that spends only ten percent 
of its work in sequential execution can at most gain a ten fold improvement in 
speed. 


-*• 


(3.2) 
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An important variation of the speedup metric is parallel efficiency de- 
fined by the average utilization of the parallel processing elements. This met- 
ric is defined by 

parallel efficiency = s P ee ^ u P _ ^ gj 

One potential drawback of the speedup and parallel efficiency metrics is 
that they tend to measure architectural constraints as well as algorithmic 
ones. Often an application with good speedups no longer achieves them when 
the application is optimized. This case occurs when the sequential fraction of 
the application is dominated by message exchange times that are not reduced 
when the parallel work is reduced by optimization. Because of this phenome- 
na, the fastest algorithm implementations often give the poorest speedup 
measurements!! 4][1 6] . F or this reason speedup and parallel efficiency can be 
valuable metrics for measuring the efficiency of an application on a given 

hardware platform but they are not useful for the purpose of categorizing par- 
allel algorithms. 


A Definitio n of Parallel Algorithms 

In an informal definition, an algorithm might be defined as a recipe or a 
description of a set of processes required to achieve a defined goal. In this in- 
formal setting, a parallel algorithm might be defined as a recipe that describes 
concurrent processes. Likewise, a formal definition of parallel algorithms may 
be described by first expanding the definition of a sequential algorithm de- 
fined by an ordered set of operations given by 

A = [o lt 0 2 - • • ,O n ) . (3.4) 

Now define a partially ordered algorithm as an ordered set of sets given by 
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* 


SP = {01,02. .0m) , 


(3.5) 


*1 = («i.o 2 ,-,oj,* ! =(o /H ,o ;t! , 


".oj, 


such that any algorithm constructed from an arbitrary ordering of the opera- 
tors contained in sets <p l through <p m produce an algorithm that is equivalent 
to jL. Two algorithms are equivalent if and only if applying all operations in 
each algorithm produces the same results in every case. A partially ordered 
algorithm is a parallel algorithm if each set <p i through <p m exhibits the prop- 
erty of operation independence. The set of operations <p i has the property of 
operation independence if and only if no two operations contained within <p t 
read or write to the same data space written to by the other. 

The character of the parallelism defined by a parallel algorithm can be 
captured in a parallelism profile defined for the parallel algorithm as the set 
of ordered pairs given by 


v-w = ■ ■ ■ .K, r J} ■ (3.6) 

The degree of parallelism, , is defined by the number of elements in the set 

<Pi . The operation execution time, 7^ , is defined as the maximum execution 
time required to complete any operation in the set <p i . 


Scaled Speed up and Fixed-Time SiVa-Up 
Traditionally, the set of applications considered as viable for computa- 
tional solution grows as the power of computational engines grow. Likewise, 
the size of the problems considered for computational solution typically grows. 
This property can be attributed to two factors: 1) problem sizes considered on 
earlier computational engines tend to be less interesting because most have 
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been solved, and 2) modeling of key physical phenomena becomes a limifiogr 
factor in many design processes giving rise to a demand for more complex and 
larger scale computational modeling capability. Gustafson[16] argues that 
Amdahl s Law should be reevaluated in light of larger parallel mndiinAq since 
the sequential fraction of an algorithm is typically a function of the size of the 
problem considered. Gustafson further argues that a scaled speedup metric 
(that is a measurement of speedup as both problem size and number of proces- 
sors are increased) is a more appropriate measurement for parallel applica- 
tions. A significant criticism of scaled speedup is that problem sizes should be 
driven by application requirements, not measurement requirements. (What 
value is there in knowing a problem of a large size can be solved efficiently 
when the need for the solution of a problem that size hasn’t been clearly dem- 
onstrated?) 

Further work of Sun and Gustafson[17] argues for the use of a fixed- 
time size-up metric as a measure of performance. In fixed— time size-up, the 
size of the problem that executes in a predetermined amount of time is re- 
corded as the number of processors increase. It is argued that the fixed-time 
s * ze— U P metric is “fairer” than speedup because results are much less affected 
by optimization of the parallel work. A further assertion is made that fixed- 
time size-up is machine independent which implies that it could be a good 
metric for measuring an algorithm’s performance independent of the hard- 
ware platform. 

Scalability o f the Parallel Approximate Factorization Algorithm 

Since significant speedups were achieved using a Cray vector processor 
for the implementation of the forward and backward substitution phases of 
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the approximate factorization algorithm[ll], it is natural to ask if the parallel- 
ism used to gain vector processing speedups could be used in a multicomputer 
setting. This question has been addressed by Welch[18] where he developed a 
fine grained decomposition of the problem for multicomputers. Through the 
use of a communications coprocessor that was capable of scheduling tasks effi- 
ciently, granularity could be adjusted by appropriately mapping computation- 
al cells to processors[19]. A variety of mappings of cells to processors was con- 
sidered. No speedups were achieved with this approach primarily due to 
small problem size and high message latencies. Whether larger problem sizes 
could exploit multicomputer parallelism remained an open question of this 
work. 

The question of using larger problem sizes to utilize multicomputer par- 
allelism was addressed in [20] where it was shown that the problem size re- 
quired to utilize an increasing number of processors grows superlinearly. This 
suggests that if a large multicomputer is considered for implementation of the 
parallel approximate factorization algorithm, a significantly large problem 
size would be required to effectively use these processors. 

In an attempt to further understand the nature of this algorithm, a 
measurement of the size-up metric will be made on an idealized parallel pro- 
cessor with the unique property of having zero message passing cost. The idea 
behind this measurement is that it will provide an upper bound on the ex- 
pected performance of this algorithm. The execution time required by an 
idealized parallel processor with zero message passing cost can be modeled 
given the parallelism profile of a given parallel algorithm. Given this profile, 
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the time required to compute a parallel algorithm 3>onan idealized multicom- 
puter consisting of N processors is given by 


execution time 



N 



(3.7) 


The parallelism profile for the parallel approximate factorization algo- 
rithm can be defined in three stages of operations: 1) construction of the linear 
system, 2) performing the forward substitutions, and 3) performing the back- 
ward substitutions. These operations are performed on a three dimensional 
computational space composed of computational cells organized into a three 
dimensional array of dimensions m, n, and p. The first operation (construct- 
ing the linear system) is comprised of computing the right hand side of the 
equations and computing the flux Jacobians. These operations can be per- 
formed in any order and thus become the first step in the parallel algorithm. 
Since 60 percent of the work of this algorithm is spent performing this opera- 
tion, and the work is distributed evenly between cells, the time required to 
execute this operation can be expressed as 0.6 normalized time uni ts per cell. 

Td describe the ordering required to correctly perform the forward sub- 
stitution consider that each cell in the computation space can be identified via 
three indices: ij, and k where l < i < m, l < j < n, and 1 < k < p . At first 
the cell identified by i + j + k =3 (given by 1+1+1 = 3) is the only computa- 
tional cell that can contribute to the forward sweep. Next, three cells defined 
hyi+j + k =4 (2+1+1 = 1+2+1 = 1+1+2 = 4) can be operated on concurrently. 
This process continues until only one cell, defined by i + j + k = m + n + p, 

can contribute to the computation. The backward substitution is identical ex- 
cept the order of these steps is reversed. 
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The description of the ordering required for the forward and backward 
substitution phases can be used to develop a parallel algori thm for the sub- 
stitution process. The parallelism profile of this parallel algorithm for a 10 by 
10 by 10 problem size is given in Figure 3.1. Since 40 percent of the work re- 
quired by the complete algorithm is required to perform the forward and back- 
ward sweeps, a normalized time of 0.2 per computational cell is required to 
perform a forward or backward sweep operation. 



Figure 3.1 Degree of Parallelism for a 10x10x10 Problem Size 


With the description of the parallel algorithm complete, the execution 
time on the idealized multicomputer can be realized using equation (3.7). The 
program listed in Appendix A was used to compute equation (3.7) and itera- 
tively find the fixed— time size— up curve. Before computation of the size— up 
curve can begin, a time must be chosen. For a first pass the time required to 
execute a 10 by 10 by 10 problem size on a single processor was considered. 
The size— up curve for this case is given in Figure 3.2. The problem size was 
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normalized to the 10 by 10 by 10 problem size and the dotted line represents 
an optimal size— up curve. 


Problem 

Size 



Number of Processors 

Figure 3.2 Size— Up Based on 10x10x10 Sequential Execution Time 


The results shown in Figure 3.2 look very pro mising , but remember 
that these results are derived from an idealized model multicomputer. In 
practice the amount of parallelism that could actually be exploited would be 
much less because cell level granularity would be too fine to efficiently exploit 
on any current multicomputer system. One way to increase the granularity of 
the parallel algorithm would be to cluster cells and solve each cluster as if it 
were a sequential operation. A clustering of cells into cubes of size 3 by 3 by 3 
would increase the granularity of the algorithm at the cost of reducing the 
available parallelism. Because the structure of the parallelism would essen- 
tially be the same, we can model the clustered algorithm with a smaller prob- 
lem size. The 10 by 10 by 10 clustered solution can approximately be modeled 
by choosing the time for the fixed-time size-up curve to be the execution time 
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on a single processor of a 3 by 3 by 3 problem. The results of this expe rimen t 
are shown in Figure 3.3. 



Number of Processors 


Figure 3.3 Size-Up Based on 3x3x3 Sequential Execution Time 


The results shown in Figure 3.3 are less promising, but what is more 
profoundly disturbing is the wide variance in results of the size-up measure- 
ment in the two previous cases. It seems that the size-up curve can look 
arbitrarily good depending on the choice of the fixed— time constraint. A 
possible explanation of these variances is that by starting with the larger 
fixed-time constraint, the unused parallelism of the sequential case is “bor- 
rowed” during the initial stages of size-up. If this is the case, a better size-up 
measurement could be taken by first “boiling off* the sequential case parallel- 
ism by performing a fixed problem size scale-up as suggested by Amdahl’s 
Law and then use the scaled execution time to do a fixed-time scale-up. The 
procedure for this approach would work as follows: 1) select an arbitrary 
problem size, 2) do a fixed-size scale-up until parallel efficiency is at 65 
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percent, and 3) use the execution time from fixed-size scale-up to do a fixed- 
tune size-up. The rationale for choosing a parallel efficiency of 66 percent will 

be given later. When this approach is used the curve shown in Figure 3.4 
results. 

Figure 3.4 is based on the application of the previous procedure for an 
initial problem size of 10 by 10 by 10. The same basic curve resulted from 
applying the same procedure to problem sizes ranging from 3 by 3 by 3 to 20 by 
20 by 20 suggesting that this method is a much more reliable measurement 
than simple size-up where the fixed-time constraint is left unspecified. 



Number of Processors 


Figure 3.4 Size-Up After 65% Parallel Efficiency Scale-Up 


Obviously the curve shown in Figure 3.4 is not promising if nearly 4000 
additional processors are required to solve a problem only six times larger in 
the same amount of time. It is still hard to feel comfortable with these results 
since the first size-up curve did show a near optimal size-up curve. It seems 
very large problems could be solved with this algorithm on massively parallel 
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processors. The confusion that results from examining the size-up metric can 
be linked to the fact that the size-up metric makes no attempt to measure cost 
effectiveness, much less optimize it. 

Sterling and Laprade[21] used a cost effectiveness metric in the analy- 
sis of a simple parallel model given by 


Cost Effectiveness = ^ er f ormar ^ e 

Resource Cost 


(3.8) 


In this simple measure of cost effectiveness, performance is defined by the 
amount of work performed per amount of time and resource cost denotes the 
cost of computational resources. For parallel processors, the resource cost can 
be based on node-hours defined as the product of execution time and the num- 
ber of processor used. Given this measure of resource cost, equation (3.8) can 
be written in more explicit terms as 


Work 

Cost Effectiveness = ■ (3.9) 

In equation (3.9), Work is defined as the number of operations required to 
solve the problem on a sequential architecture; and the variable t is defined as 
the time required to obtain a solution on the parallel processor. 

Sterling and Laprade[21] showed that for their simple parallel model, 
cost effectiveness was maximized when parallel efficiency was at 60 percent. 
In order to determine the point of optimal cost effectiveness for the parallel 
approximate factorization algorithm, the cost effectiveness and efficiency were 
measured during a fixed-size scale-up of a 10 by 10 by 10 problem. Figure 3.6 
represents the cost effectiveness versus efficiency curve resulting from this 
measurement demonstrating that optimal cost effectiveness occurs at 65% 
parallel efficiency. 
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Figure 3.5 Cost Effectiveness Curve for a 10x10x10 Problem Size 
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Figure 3.6 Optimal Cost Effectiveness Versus Problem Size 

Perhaps a more interesting measurement of a parallel algorithm than 
the size-up measurement is the measure of the optimal cost effectiveness as 
the problem size scales. Figure 3.6 illustrates this measurement for the paral- 
lel approximate factorization algorithm. This figure demonstrates that if it is 
not cost effective to solve a small problem using the parallel approximate fac- 
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torization algorithm, it will not be cost effective to solve larger problems. Con- 
sidering this and the results of Welch[18], the parallel approximate factoriza- 
tion algorithm is considered to be of limited value as an algorithm for 
massively parallel computer systems. 

Analysis of the Modified Algorithm 

The forward and backward substitution passes are the key sources of 
scalability problems in the parallel approximate factorization algorithm. 
Upon examination, the transfer of Aq values from one computational cell to 
another is the key data dependency driving the ordering of the substitution 
passes. A possible way to improve the parallel structure of the approximate 
factorization algorithm would be to develop approximations for Aq that are 
free of dependencies. Belk[12] demonstrated that a computational space can 
be partitioned into blocks, and that a correct solution can be obtained with 
^<7 = 0 values transferred at block boundaries. With this relaxed criterion for 
the exchange of Aq values, all blocks can execute concurrently within a solu- 
tion iteration. The analysis of this parallel algorithm follows two courses: 1) 
analysis of the parallel structure of the algorithm, and 2) analysis of the nu- 
merical costs introduced by relaxing Aq exchanges. 

The analysis of the parallel structure of the modified algorithm as- 
sumes that a computational problem space will be partitioned into blocks, and 
zero Aq boundary exchanges will be performed such that computation of each 
block can proceed independently within an iteration. Since the substitution 
phases that exist within blocks are implemented as sequential algorithms in 
the parallelized turbo-machinery application, the parallelism that results 
from the substitution phases within the blocks are not considered. 
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Given this model, a simulation of this algorithm’s execution on the 
idealized multicomputer can be developed. The efficiency curve for this algo- 
rithm varies widely as the number of processors are scaled mainly due to the 
stair-step speedup that occurs as the number parallel elements become divis- 
ible by the number of processors. This behavior makes it impossible to plot 
the cost effectiveness versus efficiency curve used earlier to determine the 
point of optimal cost effectiveness. However, the point of optimal cost effec- 
tiveness can easily be read from a plot of the cost effectiveness metric versus 
the number of processors in the idealized multicomputer. The plot of the cost 
effectiveness and efficiency is given in Figure 3.7 for a problem case consisting 
of 27 blocks. In this case the point of optimal cost effectiveness is found when 

the number of processors is equal to 27. The efficiency for this case is 100 per- 
cent. 



Number of Processors 


Figure 3.7 Cost Effectiveness and Efficiency as Processors Scale 

Figure 3.7 shows that for at least one case, optimal cost effectiveness 
occurs when the index of parallelism is equal to the number of processors. 
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When N p is the index of parallelism, W s is the sequential work required by a 
single domain, t s is the time required to execute a single domain, and N= N p , 
the cost effectiveness equation becomes processor and problem size indepen- 
dent. The equation for cost effectiveness for the case where N=N p is given by 


Cost Effectiveness — _ 


W s N p 


W s 


(3.10) 


Nt 2 N {t s N p /N ) 2 t 2 ' 

Equation (3.10) demonstrates that if the problem is scaled up such that do- 
mains of equal size are added at the same rate as processors are added, the 
cost effectiveness will remain constant, and thus the parallel structure of this 
algorithm is scalable. 


The analysis of the parallel structure of the modified algorithm did not 
consider the communication structure required to implement the algorithm. 
The communications structure that results from partitioning the domain is 
largely a mesh topology and is expected to map in a scalable way to architec- 
tures that implement mesh topology communication networks. 


Me asuring Numerical Costs in the Modified Alpnrithm 
The modified algorithm contributes error terms to the left hand side of 
numerical formulation given in Chapter II. The assumption is that the New- 
ton iterations used to converge to a final solution (at a given time-step) will 
also diminish the effect of error contributions from the modified algorithm. 
The performance of the modified algorithm can become degraded because 
additional Newton iterations may be required to reach convergence. Because 
the precise number of extra iterations that will be required to reach conver- 
gence can not be predicted, numerical experimentation is required to estimate 
these effects. 
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Because the cost effectiveness formulation is based on the work re- 
quired by the sequential algorithm and the time required for solution on a par- 
allel system, the cost effectiveness measurement can be used to evaluate the 
results of numerical experiments. Numerical experiments will proceed along 
similar lines as the previous analysis. First an optimal cost effectiveness 
should be determined by increasing the level of partitioning in a fixed-size 
problem. If the problem size can then be scaled up without reduction in maxi- 
mum cost effectiveness, this algorithm will be considered scalable. 

On the Cost Effectiveness Metric 

Traditionally, measurements of parallel algorithm implementations 
have been limited to execution times, floating point performance, and speed- 
up. The problem with these measurements is that they do not fit into any co- 
herent theory by which algorithms can be classified. The cost effectiveness 
analysis presented in this chapter begins to solve this problem by introducing 
a classification that can be reliably measured, that is scalability. The basis of 
this analysis is the determination of the optimal cost effectiveness curve for 
increasing problem sizes. If measurements are not taken at the optimal cost 
effectiveness they are useless since the potential of the algorithm was not 
measured. This is the key problem with the fixed-time size-up measurement 
when the fixed-time is not set at the point of optimal cost effectiveness. That 
is, a fixed-time size-up measurement could be made entirely in a cost ineffec- 
tive domain causing misleading results as was the case in Figure 3.2. 

The power of the optimal cost effectiveness curve’s ability to classify 
parallel algorithms can be demonstrated with the following example. First 
consider an optimal sequential algorithm that does not parallelize. Since this 
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is a sequential algorithm, the execution time will be proportional to the wort. 
Given this, the cost effectiveness as a function of problem size can be com- 
puted as 


Cost Effectiveness = ^ ^ - = — l 

Ntl (kw seq f ^ • 


(3.11) 


seq 


where k is the constant of proportionality that relates Work to execution time. 
Equation (3.11) demonstrates that the cost effectiveness of a sequential ap- 
plication declines as a reciprocal of problem size. 

Also considered in this example is a parallel algorithm that is not scal- 
able such as the parallel approximate factorization algorithm. This algorithm 
is able to achieve a high cost effectiveness for small problem sizes but is un- 


able to maintain this cost effectiveness as the problem size is increased. For 
the purposes of illustration, assume that this algorithm’s cost effectiveness de- 
clines linearly. 

Finally, consider a scalable parallel algorithm that maintfling a 


constant cost effectiveness as the problem size is increased provided that 
enough processors are available. Now a plot of the optimal cost effectiveness 
of each algorithm can be used to determine the optimal algorithm based on 
problem size. For small problem sizes, the algorithm that is not scalable 
achieves a higher cost effectiveness than any of the other algorithms. Because 
the scalable algorithm can maintain its cost effectiveness as the problem size 
is increased, it will eventually become the most cost effective as is demon- 
strated in Figure 3.8. 
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One essential criterion required by a scalable algorithm if it is to main , 
tain a constant cost effectiveness is the availability of processor resources. In 
practice, no parallel computing system has an infinite number of processors so 
at some point any scalable algorithm will reach a problem size beyond which 
no additional performance gains can be made. At this point the cost effective- 
ness should begin to decline as a reciprocal of the problem size for the same 
reasons that the sequential algorithm exhibited this behavior. Making this 
observation, it is possible to see how cost effectiveness can be used to deter- 
mine which algorithm will be optimal under conditions of limited processor re- 
sources. F or example, consider two scalable algorithms: algorithm A and algo- 
rithm B. Algorithm A is able to attain a higher cost effectiveness when 
processors are available because it can effectively utilize more processors on a 
given problem size than algorithm B. The implication of this relationship is 
that algorithm A will reach the limit of available processors at a smaller prob- 
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lem 6ize than algorithm B. When this occurs algorithm A’s cost effectiveness 
will begin to decline while the cost effectiveness of algorithm B will continue to 
remain constant which gives algorithm B the opportunity to become the most 
cost effective for larger problem sizes as illustrated in Figure 3.9. 



The previous example can be used to describe why the parallel approxi- 
mate factorization algorithm is appropriate for the Cray and not for massively 
parallel computer architectures. In the case of the Cray, a relatively small 
number of vector processing elements act concurrently with low data transfer 
costs. In this environment, the parallel approximate factorization algorithm 
gains the maximum advantage since low data transfer costs allow for a fine 
grained implementation. On the other hand, the modified algorithm will nev- 
er be able to obtain enough concurrent computing resources to become cost ef- 
fective since the Cray implements a relatively small vector size. In a massive- 
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ly parallel computer architecture the tables are turned since coarser grained 
computations are required to overcome communication costs and there is an 
abundance of concurrent computing resources. In this case, the parallel 
approximate factorization algorithm is never able to attain a cost effective 
solution because of granularity limits. In contrast, the modified algorithm 
takes advantage of coarse granularity and abundant parallelism to obtain cost 
effective solutions for large problem sizes. 

The examples presented here demonstrate the power of the cost effec- 
tiveness analysis in determining the desirability of different parallel algo- 
rithms. It seems that this metric is highly relevant to parallel algorithm anal- 
ysis and is a powerful tool to understanding the complex behaviors that 
parallel algorithms can exhibit. 



CHAPTER IV 
IMPLEMENTATION 


The process of parallelizing a computer program developed for a partic- 
ular application involves developing an evolving strategy for implementing 
the application on a parallel architecture. This process begins with an analy- 
sis of the algorithm used in the application. This analysis has resulted in the 
selection of the modified algorithm described in Chapter II as a potentially 
cost effective approach. The strategy for parallelization is further refined dur- 
ing study of the application’s sequential implementation because decisions 
made when an algorithm is implemented on a sequential architecture can im- 
plicitly assume sequential execution. The final step in the process involves 
making compromises between reducing the degree of modification required of 

the sequential application and obtaining high utilization of a given parallel 
architecture. 

The final result of the parallelization strategy is a set of incremental 
modifications applied to the sequential program such that the expression of 
parallel execution is accomplished by way of architectural and system soft- 
ware semantics. The turbomachinery application developed by Janusfl] was 
chosen for parallelization because it solved a class of problems that tax the 
computational capacity of traditional sequential architectures, and because its 

multi-block capability could be used to implement the modified algorithm de- 
scribed earlier. 
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Object-Oriented Fortran 

Object-Oriented Fortran[22] was selected as the programming lan- 
guage for the parallel implementation of the turbomachinery application. Ob- 
ject-Oriented Fortran provides an interface loosely based on object-oriented 
design principles with an emphasis on execution and synchronization of multi- 
ple object instances on parallel computing platforms. Object-Oriented For- 
tran is a portable parallel programming language that can allow programs to 
port without change to a wide range of parallel architectures including net- 
works of workstations. This feature alone allowed for the use of workstation 
style debuggers and allowed idle workstation cycles to be used during the de- 
velopment stages of the parallel application. In addition to these features, Ob- 
ject-Oriented F ortran provides a level of abstraction that makes dealing with 

mapping problems to processors and packaging messages less tedious than 
portable message passing libraries. 

Preparation of the Application 

The sequential turbomachinery application was modified in several 
stages in order to prepare it for the final parallel implementation. The first 
issue that was dealt with was the treatment of common block storage in the 
original application because common block storage was used as a mechanism 
for passing data between subroutines. This presented a problem in the paral- 
lel implementation on distributed memory architectures since the subtle coor- 
dination of a global data space as a means of communication between subrou- 
tines needed to be redefined in terms of the explicit message passing and 
synchronization semantics of the distributed memory model. In an effort to 
get a handle on where common block storage was used to co mm unicate be- 
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tween what would be separate distributed memory modules, the common 
block mechanism was incrementally replaced with subroutine arguments as a 
way of passing data between subroutines. Although this approach tended to 
make the argument lists for subroutines large, it also made the data depen- 
dencies between subroutine calls much more obvious. 

During the conversion from common block storage to subroutine argu- 
ments as a means of passing information between subroutines two points be- 
came apparent: 1) the sequential implementation of communication between 
blocks depended strongly on sequential execution for proper synchronization, 
and 2) “hardwired” parameters in the boundary condition and residual cal- 
culation subroutines would make testing a wide variety of domain decomposi- 
tion strategies difficult. In light of these two findings, the next step in prepar- 
ing the sequential application for parallel execution was to strip out all of the 
multi-block support and develop a general single block code that would have 
no “hardwired” parameters in its implementation of the boundary condition 
and residual calculation routines. This application could then be advanced to 
a parallel multi-block code by re-implementing the inter-block boundary 
management with distributed memory semantics. The development of the 
general single block code required many months of effort dominated by the 

process of removing common block storage as a means of passing data between 
subroutines. 


Development of t h e Parallel Application 
The development of the parallel application involved adding parallel 
constructs to the single block code that would enable multiple blocks to 
execute m parallel by way of the modified algorithm. A block is represented 
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by a three dimensional region of the simulation space constructed from a 
structured grid that is a rectangular prism in computational space. The faces 
of the rectangular prism for a given block are divided into rectangular surface 
patches. These patches were used in the single block implementation of the 
solver to impose boundary conditions such as impermeable surfaces, inflow, or 
outflow. In the parallel case these boundary conditions were extended to in- 
clude a connective boundary condition. The connective boundary condition is 
described by the address of another patch which becomes a repository of q 
variable information. The patch address consists of a unique block identifier 
and a patch identifier that is unique within a block. In addition to the transfer 
°f Q va l ues through the use of the connective boundary condition, rotation op- 
erations could be applied to the momentum vector component of the q vector 
when required to impose symmetry about the x axis. When the connective 
boundary condition was added to the single block code, the application was ca- 
pable of parallel solution of multi-block problems that did not involve relative 
motion between blocks. For turbomachinery problems an additional commu- 
nication semantic was required, namely the communications peculiar to the 
interface between domains that did involve relative motion between blocks 
represented by blocks rotating at different angular velocities. 

In the sequential application, transfer of information between bound- 
aries shared by blocks in relative motion was handled with a communication 
management structure called a ring buffer. The ring buffer was a global buff- 
er where information would be inserted from one block and subsequently read 
by another block. A moving cursor (pointer) to the ring buffer was used to 
handle the changing communication patterns caused by the rotating interface. 



44 


This approach was impractical to implement in the parallel application since 
it was based on the assumption that the two blocks accessing the inte rface 
would share a common ring buffer resource. In a parallel implementation, a 
global ring buffer would cause a bottle-neck in the communication since t his 
global buffer would need to be m anaged by a single processor. Ib circumvent 
this problem, blocks need to directly communicate to the appropriate block on 
the other side of the ring buffer interface instead of going through a global 
buffer “middle man”. The semantic capturing this direct communication 
across the ring buffer interface has been named a ring buffer connection. In 
the ring buffer connection, a patch is given the angular velocity of the blocks it 
is connecting to and a circular list of addresses to patches that will be commu- 
nicated with as the rotation of the interface progresses. With this information, 
a patch involved in a ring buffer communication can partition its information 
(as a function of time-step and angular velocities) and pass this information 
directly to the appropriate partners on the other side of the ring buffer inter- 
face. A schematic of this connection for a single patch is given in Figure 4.1 
In order to handle symmetric solutions, a rotation operator is included with 
each patch address. The ring buffer connection is managed by dividing the 
given patch into sub— patches that will match the patch divisions on the other 
side of the ring buffer connection and then sending these sub-patches to the 
appropriate blocks. When a block receives this partial patch it also receives 
an offset which it uses to store the information in the appropriate locations in 
the computational space. In addition to q variable information, the ring buffer 
connection also communicates information that is used to compute a new grid 
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for a region between the rotating interfaces. This communication uses the 
same basic procedures involved in the communication of the q variables. 



Figure 4.1 A Schematic of a Ring Buffer Connection for One Patch 


The implementation of the ring buffer connection required complex pro- 
cedures in order to calculate sub-patches from moving patch interfaces. Sub- 
routines handling these connections were implemented in C and tested sepa- 
rately from the application so that high confidence in the implementation of 
this communication procedure could be achieved. The C language was chosen 
for this task over Fortran because of its advanced capabilities regarding dy- 
namic memory management and complex data structures. The final parallel 
application was validated as correctly implementing the parallel communica- 
tion semantics by performing an exact comparison of the results generated by 
the parallel implementation to those of the sequential implementation. 

The result of the block oriented communication semantics implemented 
m the parallel turbomachinery application was an input file that is complex 
and filled with a large degree of redundant information. The complexity of 





this file made creating a corresponding input file for the parallel application a 
tedious process. This problem is addressed through the development of an au- 
tomated domain decomposer that generates both a partitioned grid and input 
file for the parallel application. 

The Automated Do main Decomposition Thnl 
The domain decomposition tool is responsible for partitioning the grid 
mto sub-grids that will be used for parallel computation. Since the work re- 
quired by a sub-domain is proportional to the number of cells in the domain, 
load balancing is achieved by making the number of cells in each sub-grid 
approximately equal. A key challenge of the domain decomposition tool was 
the development of an expression for boundary condition and connectivity in- 
formation described by the original grid in such a way that this information 
could easily propagate to the decomposed grid. This information is captured 
m a file format that is based on a block structured language such as the C lan- 
guage. The input parser for this input file format was developed using the 
standard lex and yacc progra mming tools. 

The input file defines its data by variable assignment and variable 
scoping. Variables in the input file can have four basic types: integer num- 
bers, real numbers, character strings, and existential. The first three are well 
known types and need little description, whereas the existential variable type 
is a variable that is defined by its existence in the input file. This variable can 
be considered as a boolean type and is used to turn on various features of the 
decomposition tool. 

The first variable scope of the input file is the global scope. This vari- 
able scope is used to define general parameters of the simulation and grid. 
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Table 4.1 contains a partial list of the general parameters used by the domain 
decomposition tool. 


Table 4.1 Variables Accessible from the Global Variable Scope 


Variable Name 

Type 

Variable Values 

temporal_accuracy 

STRING 

“first-order* 

“three-point-backward” 

spatial_accuracy 

INTEGER 

1-3 

limiter 

STRING 

“minmod” 
“superbee” 
“van— leer” 

number_of_cycles 

INTEGER 

1-MAXINT 

beta 

REAL 

k 

1 

to 

a 

flux j acobian_update_frequency 

INTEGER 


number_of_refinement_iterations 

INTEGER 

l-MAXINT ~ i 

number_of_cycles_between w _clicks 

INTEGER 

1-MAXINT 

grid_symmetry 

INTEGER 

1-MAXINT 

free_stream_mach_number 

REAL 

0.0 - 2.0 


The READ_GRID command is used to read in the initial grid file for the 
simulation. This command is similar in format to a function call in the C lan- 
guage and is given two arguments: a file name, and the number of grid blocks 
in the file. The grid blocks that are read from the grid file are numbered as 
they occur in the file starting with one. The grid block numbering is used to 

identify the grid block when the boundary conditions and connectivity in- 
formation is given. 

The next variable scope used in the input file is the variable scope of the 
grid block. This variable scope is created by the BLOCK command. The 
BLOCK command has a syntax similar to a structure declaration in the C lan- 
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guage in that it uses braces to delimit the context of the scope it creates. The 
arguments to the BLOCK command are the grid identifier created by the 
READ_GRID command and a name that will be assigned to the grid block for 
use in developing connections between grid blocks. The only variable current- 
ly used m the scope of the block command is the divide _bhxk variable. The 
divide_block variable is used to tell the decomposer how many equally sized 
partitions in which to divide a grid block for parallel solution. 

Within the scope of the BLOCK command several commands are avail- 
able for describing boundary conditions associated with the grid. The first of 
these commands is the PATCH command. The PATCH command is used to 
create rectangular regions on the exterior surface of the computational space 
(representmg surfaces of rj, or 5 = constant) represented by the geometry for 
the assignment of boundary and connectivity information. The PATCH com- 
mand divides the exterior surfaces of the computational space into three 
classes of patches denoted by IJ_FACE, JK_FACE, and KI_FACE. Each 
PATCH command creates a pair of patches: one for the lowest index of the grid 
and one for the highest index of the grid. For example: when the grid points in 
a grid block are numbered by three indices such that i=[l,ni], j=[l,nj], 
k-[l,nk], a patch command that defines a pair of patches on the IJ_FACE de- 
fines patches for the exterior surfaces at k=l and k=nk. The PATCH command 
assigns a name to the pair of patches it creates. The individual patches of the 
pair are denoted by following the name of the patch pair by a plus or a minus 

sign. The plus (minus) sign denotes that the patch on the high (low) index 
value of the grid will be used. 
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The TERMINAL command is also available within the scope of the 
BLOCK command. The TERMINAL command is used to assign lists of 
patches to terminal identifiers. Terminals represent end-points of commu- 
nication and will be used to express boundary conditions and connections. 
The argument of the TERMINAL command is the name of the terminal that it 
creates. The TERMINAL command is followed by a list of patches enclosed in 
braces. When two terminals are connected to one another, the patches fisted 
by the TERMINAL command will be connected in the order they appear. 

The final command that is available to the scope of the BLOCK com- 
mand is the TERMINAL_ATTRIBUTES command. This command is used to 
assign boundary conditions to terminals. The TERMINAL_ATTRIBUTES 
command creates a new variable scope for the terminal name that is given to 
the TERMINAL.ATTRIBUTES command. Two variables are recognised in 
the terminal variable scope by the domain decomposition application and they 
are listed in Table 4.2. 


Table 4.2 Variables within the Scope of TERM IN AL_ ATTRIBUTES 


Variable Name 

Type 

Variable Values 

boundary_type 

STRING 

“impermeable” 

“inflow” 

“inflow-outflow” 

“outflow” 

zero_area 

EXISTENTIAL 

None 


The final variable scope created for the domain decomposition applica- 
tion is the variable scope created by the ZONE command. A zone is a collec- 
tion of grid blocks which share the same angular velocity. Because all grid 
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blocks share the same angular velocity within a zone, the. connections between 
grid blocks within a zone are static. The only variable that the domain decom- 
position tool uses in the variable scope created by the ZONE command is the 
normalized angular velocity of the zone represented by the variable dtdt. In 
addition to the variable dtdt, there are several commands available within the 
scope of the ZONE command. 

The first command within the scope of the ZONE command is the 

GRAB_BLOCK command. This command is used to assign a grid block to a 
zone. 

The CONNECT command is used to create static connections between 
terminals defined in blocks assigned to the zone. The CONNECT command 
identifies two terminals that will communicate information during the simu- 
lation. A terminal is identified by the name of the grid block and the name of 
the terminal separated by a period. If the CONNECT command is followed by 
the keyword SYMMETRIC then momentum vectors communicated across this 

connection will be rotated in order to satisfy the geometric symmetry of the 
problem. 

The final command available within the scope of the ZONE command is 
the RING_BUFFER command. Ring buffers are used to communicate in- 
formation between zones since relative motion will be present in these connec- 
tions. The RING_BUFFER command defines the computational dimension 
along which grid line shearing will occur as the zones rotate, the depth into 
the zone in which regridding will occur, and a list of terminals that will partic- 
ipate in the ring buffer interaction. 
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Finally, CONNECT commands in the global context are used to connect 
zones by connecting ring buffers defined in each zone. The syntax of this 
CONNECT command is the same as the CONNECT command of the zone ex- 
cept that no symmetric attribute may be applied and the arguments to the 
connect command are ring buffer identifiers rather than terminal identifiers. 

An example input file for the unducted counter rotating prop fen prob- 
lem studied in Chapter V is given in Appendix B. 



CHAPTER V 
RESULTS 

The objective of this chapter is to present the results of numerical ex- 
periments aimed at answering two key questions: 1) is the parallelized algo- 
rithm a cost effective approach (compared to sequential solution) even though 
it requires extra computational effort brought on by additional Newton itera- 
tions, and 2) is the cost of additional Newton iterations controlled as larger 
problem sizes are considered. 

Computing the T^ e vel of ConvergenftA 
The value of the Aq vectors computed during the solution process is a 
measure of how well the solution satisfies the discretized equations, where the 
form of the discretized equations depends upon the formulation. For example, 
when Newton iterations are not used the value of Aq represents how well the 
spatial numerical derivatives of the discretized equations are satisfied. This 
provides a good measure of the level of convergence for a steady state flow field 
since the temporal derivatives are zero in a steady state solution. On the oth- 
er hand, the value of the Aq vectors computed during the Newton iteration 
process provide a measure of how well both the spatial and temporal numeri- 
cal derivatives are satisfied, thus providing a measure of convergence for un- 
steady flow fields. Either set of these Aq vectors provide a local measurement 
of convergence for the discretized equations and can also provide a global mea- 
surement if an averaging process is applied to them. The averaging process 
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applied to compute a global convergence level in this chapter is a root-mean- 
square average. This average favors a majority reduction in the magnitude of 
the Aq vectors. The value computed using this average is called the root- 
mean-square sum of the residual terms, or more simply the RMS residuals. 
For the parallel application, the RMS residuals are computed independently 
for the density, momentum, and energy terms of the Aq vector providing three 
independent measures of global convergence. 

The Solution Process 

In the sequential application, a typical simulation involves a two step 
process. The first step involves an impulsive start where the geometry of the 
simulation is thrust into a steady flow field of constant free stream Mach num- 
ber. The simulation progresses through a complete turbomachine revolution 
in order to arrive at a solution that is relatively free of the transient conditions 
created from an impulsive start. Newton iterations are not used during the 
first revolution since high resolution answers of the transients caused by the 
impulsive start are not considered interesting. This first revolution can be 
considered as a means of approximating an initial condition for the simula- 
tion. After the first revolution, two additional revolutions are simulated using 

Newton iterations. The number of Newton iterations used during this stage 

of the simulation is the number required to reduce the RMS residuals by two 

orders of magnitude. The parallel application will employ the same procedure 
to obtain a solution. 
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Solution Resul ts from the Parallel Application 
The numerical experiments presented in this section were performed on 
a GE UDF8— 8 counter-rotating unducted fan immersed in a Mach 0.7 flow 
field. Two cases are considered for this geometry: a zero angle of attack con- 
figuration and a configuration at ten degrees angle of attack. The zero degree 
angle of attack configuration can take advantage of symmetry to reduce the 
number of grid cells required by a factor of eight. These two configurations 
allow for the measurement of the scalability of the algorithm. As presented in 
chapter III, the zero angle of attack problem will be executed on larger num- 
bers of processors until cost effectiveness is maximized. Then the problem will 
be scaled up by a factor of eight when the ten degree angle of attack case is 
considered. The cost effectiveness of the scaled-up case will be measured and 
if it remains relatively constant the algorithm will have demonstrated an abil- 
ity to scale to larger configurations and remain cost effective thus demonstrat- 
ing scalability. 

An Intel iPSC/860 parallel computer comprised of 32 hypercube con- 
nected i860 processors with 8 megabytes of RAM on each processor was used 
to measure the parallel application for the zero angle of attack configuration. 
The number of processors applied to the zero angle of attack configuration was 
varied from 4 to 12 processors and the execution times required to complete 
three complete revolutions of the geometry was measured. At 6 processors 
enough memoiy was available on each node to allow flux Jacobian freezing to 
take place (flux Jacobian freezing means that the flux Jacobians are not re- 
computed after the second Newton iteration, but instead are stored and re- 
used). The savings in computations that flux Jacobian freezing produces is 



significant. Table 6.1 lists the execution times achieved as the number of pro- 
cessors applied to the problem increases. 

The cost effectiveness for each case considered in Table 6.1 is computed 
by applying the equation 

Cost Effectiveness = problem Size 

n UF ' <61) 

This expression of cost effectiveness is equivalent to equation (3.9) when Work 

is defined in terms of problem size. Observation of the cost effectiveness for 

the Intel iPSC/860 simulations shown in Table 6.1 demonstrates that optimal 

cost effectiveness is achieved when the problem is decomposed for 10 proces- 
sors. 

When the angle of attack configuration is considered, the problem size 
grows by a factor of eight. If the maximum cost effectiveness occurs at 10 pro- 
cessors, it is expected that a maximum cost effectiveness for the angle of at- 
tack case will require 80 processors. Since only 32 processors are available on 
the Intel iPSC/860 parallel computer used for this experiment, another paral- 
lel computer was used for the cost effectiveness measurement of the angle of 
attack configuration. The Intel Delta parallel computer at the California 
Institute of Technology was used for this simulation. The Intel Delta parallel 
computer is comprised of roughly 612 mesh connected i860 processors with 16 
megabytes of RAM per processor. The zero angle of attack configuration 
execution time on the Intel Delta for a 10 processor decomposition was mea- 
eured as a reference point. The Intel Delta provided a modest improvement in 
execution time (over that of the iPSC/860) that can be attributed to differences 
in compiler and communication network performances. 



56 


The 80 processor simulation of the angle of attack configuration re- 
quired a slightly longer execution time than the 10 processor simulation of the 
zero angle of attack case. This caused a slight drop in the cost effectiveness for 
the larger problem size. Since no additional Newton iterations were required 
by the angle of attack case, the increase in execution time can be attributed to 
communication costs. Since no attempts were made to optimize communica- 
tion costs by appropriately mapping the problem decomposition to processors, 
it is expected that the execution time of the larger problem size will be reduced 
when the placement of blocks on processors is optimized. 


Table 5.1 Execution Time Results of the Parallel Application 


Processors 

Newton 

Iterations 

Jacobian 

Execution 

Time 

Problem Size 

Cost 

Effectiveness 


6 


5.56 hrs 

23520 cells 

188.17 


6 


3.83 hrs 

23520 cells 

361.66 

b ipse 

6 

freezing 

2.96 hrs 

23520 cells 

447.41 

8 ipse 

7 

freezing 

2.54 hrs 

23520 cells 

455.70 

10 ipse 

7 

freezing 

2.17 hrs 

23520 cells 

499.48 

12 ipse 

8 

freezing 

2.06 hrs 

23520 cells 

461.87 

10 delta 

7 

freezing 

2.09 hrs 

23520 cells 

538.45 

80 delta 

7 

freezing 

2.18 hrs 

188160 cells 

494.91 


Co nvergence Re sults of the Parallel Annl 
Smce the modified algorithm introduces additional errors into the solu- 
tion algorithm, it is important to compare the effects these errors have on con- 
vergence to the fully coupled sequential algorithm. For these comparisons the 
RMS residuals that result from the sequential application using five Newton 
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iterations in the final two revolutions is considered. Figure 5.1 shows the 
RMS residuals for the density component of the Aq vector for the sequential 
algorithm and the parallel algorithm for both angle of attack cases during the 
first prop revolution. The 10 block case i8 at zero angle of attack and the 80 
block case is at 10 degrees angle of attack. The RMS residuals for the energy 
component are shown in Figure 6.2. It is apparent from these plots that the 
parallel algorithm was less capable of converging the energy equation than 
the density equation. Since the energy component of the Aq vector is larger on 
average than any other term, the conclusion can be drawn that the error 
created by forcing Aq to zero at the block boundary will inject the greatest er- 
ror in the energy equation. This problem may be lessened by normalizing the 

equations in a way that balances the magnitudes of the components of the Aq 
vector. 

The RMS residuals for the final Newton iteration during the last two 
prop revolutions are given in Figure 5.3 and Figure 5.4. These results show 
that the parallel algorithm performed better than the sequential at converg- 
ing the density equations, but was less capable of converging the energy equa- 
tion. The effects of differences in convergence rates of the different residual 
terms requires further study. 

In order to verify that the parallel application converges to the correct 
solution, the pressure coefficients along the base and tip of the blades in the 
prop fan were considered. For comparison, the results at the 480 th time-step 
of the 10 processor parallel solution were compared to the results of the se- 
quential application. The results of these comparisons are shown in Figures 
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5.6 through 6.8. These comparisons demonstrate that the parallel application 
results are essentially equivalent to the sequential results. 



LOG BASE 10 RMS RESIDUAL LOG BASE 10 RMS RESIDUAL 



time step 


Figure 5.1 Density Residuals During the First Revolution 



TIME STEP 


Figure 5.2 Energy Residuals During the First Revolution 
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Figure 6.3 Density Residuals at the Final Newton Iteration 



TIME STEP 


Figure 6.4 Energy Residuals at the Final Newton Iteration 












CHAPTER VI 
CONCLUSIONS 

Two parallel algorithms, the parallel approximate factorization algo- 
rithm and the modified algorithm, have been considered as candidates for 
massively parallel multicomputer systems. Arguments have been made that 
the parallel approximate factorization algorithm is a poor choice for massively 
parallel multicomputer systems based on cost effectiveness. Numerical ex- 
periments have demonstrated that the modified algorithm is a cost effective 

means of gaming high performance on massively parallel multicomputer sys- 
tems. 

Observations of the constraints placed on parallel algorithm develop- 
ment can also be made based on cost effectiveness analysis and experiences 
related to the development of the parallel turbomachinery application. Since 
optimal sequential algorithms are based on minimizing instructions and opti- 
mal parallel algorithms are based on a balance between minimising instruc- 
tions and maximizing parallel utilization it is reasonable to assert that the op- 
timal parallel algorithm is not identical to the optimal sequential algorithm. 
This observation was also made by Chan [ 23 ] where he argued that the set of 
efficient kernels presumed when scientific applications are developed must 
change when parallel implementations are considered. The cost effectiveness 
analysis also suggests that the optimal general algorithm will be polymorphic 
in that it will change form based on problem size and the number of processors 
available. This suggests that the structural complexity of optimal parallel al- 
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gorithms will be greater than optimal sequential algorithms. The structural 
complexity of the parallel algorithm may also be increased due to limitations 
that the parallel architecture places on parallel algorithms as demonstrated in 
the implementation of ring buffer communications for the parallel turboma- 
chinery application. If parallel algorithms are necessarily structurally more 
complex, the choice of F ortran as a base language for parallel algorithms must 
be called into question. Clearly, objec^oriented paradigms designed to imple- 
ment algorithms of high structural complexity must be considered more ap- 
propriate for parallel algorithm development in the future. 

Before the modified algorithm can be considered as an appropriate par- 
allel algorithm several additional issues must be investigated. The robustness 
of the modified algorithm for extremely complex geometries needs to be dem- 
onstrated. Investigation into the potential for violation of conservation law 
principles in the modified algorithm is suggested. Alternate iterative methods 
such as successive over relaxation (SOR), preconditioned iterative methods, 
and convergence accelerators such as multigrid methods should be compared 
to the modified algorithm in a cost effectiveness study. In addition, scalability 
of the modified algorithm for grids of finer resolution should be studied fur- 
ther. Adaptive methods where the either the grid or the domain decomposi- 
tion is adapted in order to reduce iterations required by the modified algo- 
rithm is also suggested. 



APPENDIX A 

PARALLEL EMULATOR PROGRAM 
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#include <stdio.h> 

# include <mat h . h> 

/* 

Emulation program for LU decomposition based CFD algorithm. 

Emulates execution on a zero message latency parallel machine with 
variable number of processors. This emulator only estimates 
relative execution times based on the parallelism profile of the 
algorithm. 

*/ 


int parallelism_jprofile [100000] ; 

/* 

compute_parallelism_prof ile fills the array parallelism_prof ile with 
the parallelism profile of the forward and backward substitution passes 
of the LU decomposition solver. The parallelism profile is determined 
from the fact that ail cells that lie on the plane {i+j+k = constant} 
can be executed concurrently. This function returns the number of 
steps contained in the parallelism profile. The arguments to this 
function are the size of the computational space in i, j , and k 
dimensions . 

*/ 

int compute_parallelism_prof ile (ni, n j, nk) 
int ni,nj,nk ; 

{ 

int i, j, k, prof ile__size ; 

profile_size - ni+nj+nk-2 ; 

bzero (parallelism_jprof ile, prof ile_size* 

sizeof (*parallelism__profile) ) ; 

for (i=0; i<ni; i++) 
for (j=0; j<n j ; j++) 
for (k=0; k<nk; k++) 

parallelism__prof ile [i + j+k] ++ ; 
return profile_size ; 

} 
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/* 

compute_time computes the execution time for a problem with dimensions 
ni,nj,nk when executed on N processors. It computes the total time 
including the time required to set up the LU decoirposition problem. 

The amount of time required to setup the LU decomposition is determined 
from the measurement of 60% of execution time spent on this operation 
during sequential execution. 20% is spent on the forward pass, and 
20% is spent on the backward pass making a total execution time of 1 
for a problem size of 1 cell. 

*/ 

double compute_t ime (ni, n j , nk, N) 
int ni,nj,nk,N ; 

{ 

double total__time = 0.0 ; 

int prof ile_size, i ; 

double SETUP JTIME_PER__CELL = 0.6 ; 

double SUBSTITUTION_TIME_PER CELL = 0.2 ; 


total_time += SETUP_TIME_PER_CELL*ceil ( ( (double) ( (ni) * (n j ) * (nk) ) ) / 

( (double) N) ) ; 

profile_size = computejparallelism_prof ile (ni, n j, nk) ; 

for (i=0; i<prof ile__size; i++) 

total__time += 2 *SUBSTITUTION_TIME_PER_CELL* 

ceil ( ( (double) parallelism_j>rof ile [i] ) I ( (double) N) ) ; 
return total_time ; 

} 
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/* 

We are now going to use our compute_time emulator to determine the 
sizeup metric for this parallel algorithm. This is accomplished by 
first computing the time to execute the program with 1 processor 
on a given problem size. Then we will compute the time required to 
solve a larger problem on N processors. If we solve the problem in 
less time we will increase the size of the problem and try again. 

When we have found the size of the problem that can execute in the 
same amount of time, we will increment N and repeat the process. 

*/ 

main ( ) 

{ 

int n [ 3 ] , 1=0, J=1,K=2 ; 

double sequent ial__time, parallel_time ; 

int N = 1, sequent ial_size, i ; 

/* The n array contains the size of the sequential reference process * / 
n[I] - 3 ; 
n [ J] - 3 ; 
n[K] = 3 ; 

/* sequential_size is used to normalize sizeup numbers */ 
sequential_size = n [I] *n [ J] *n [K] ; 

sequent ial_t ime = compu t e__t ime (n[I],n[J],n[K],N) ; 
printf ("sequent ial time = %f \n", sequent ial_t ime) ; 
for (N=2, i=0;N<=4000;N=N+N/25+l) { 

do { 

parallel_time = compute_time (n [I] , n [ J] , n [K] , N) ; 
if (parallel_time < sequential_time) { 
n[i]++ ; 
i = (i+l)%3 ; 

} 

} while (parallel_time < sequent ial_t ime) ; 
printf ("%d %f\n",N, 

( (float) n [I] *n [ J] ^n [K] ) / (float) sequential_size) ; 

} 

exit(0) ; 

} 



APPENDIX B 


EXAMPLE INPUT FILE FOR THE DOMAIN DECOMPOSITION TOOL 



simulation__name = "Counter-Rotating Prop Fan" ; 

// temporal accuracy can be: 

// first-order 

// three-point-backward 

// trapezoidal-time-differencing (Not recommended) 

temporal_accuracy = "three-point-backward" ; 
spatial_accuracy = 3 ; 

// limiters: 

// minmod (use with second and third order) 

// superbee (use only with second order) 

// van-leer 

limiter = "van-leer" ; 

number_of_cycles = 480 ; 

beta = 0.0 ; 

// If 1 the flux jacobians will be computed each newton iteration 

// if not 1 flux jacobians will not be computed after 2nd newton 

// iteration, this saves time but uses extra memory 
f lux_jacobian_update_frequency = 2 ; 
residual_print_f requency = 1 ; 

number_of_ref inement_iterations = 7 ; 

initial_cycles_of_zero_pressure_gradient_BCs = 20 ; 
initial_cycles_of_f irst_order_calculations = 0 ; 
initial_cycles_of_entropy_violation_dissipation = 9999 ; 

free_stream_mach_n umber = 0.7 ; 

number__of_cycles_between_clicks = 1 ; 

grid_symmetry = 8 ; 


READJSRID ("grid_2-block" r 2 ) ; 



BL0CK(1, front_blad.es) { 
divide_block = 5 ; 

PATCH (xl,IJ_FACE, 1,42, 1,22) ; 

PATCH (x2,IJ_FACE, 42, 52, 1,11) ; 

PATCH (x3, IJ_FACE, 42, 52, 11, 22) ; 
PATCH (x4, 1 J_FACE, 52, 57, 1, 22) ; 

PATCH (p2,JK_FACE, 1,22,1,11) ; 

PATCH (p3,KI_FACE, 1, 11, 1, 8) ; 

PATCH (p4,KI_FACE, 1,11,8,57) ; 

TERMINAL (hub) {p4-} ; 

TERMINAL (blades) (x2-,x2+) ; 

TERMINAL (sting) {p3-} ; 

TERMINAL (exterior) {p3+,p4+} ; 
TERMINAL (front) {p2-} ; 

TERMINAL (rear) (p2+) ; 

TERMINAL (kplus) { xl+, x3+, x4+ } ; 
TERMINAL (kminus) {xl-, x3-, x4-} ; 

TERMINAL_ATTRIBUTES (hub) { 

boundary_type = "impermeable" ; 

} ; 

TERMINAL_ATTRIBUTES (blades) { 

boundary_type = "impermeable" ; 

} ; 

TERMINAL_ATTRIBUTES (sting) { 
zero_area ; 

} ; 

TERMINAL_ATTRIBUTES (exterior) { 

boundary_type = "inflow-outflow" 

) ; 

TERMINAL_ATTRIBUTES ( front ) { 
boundary_type = "inflow" ; 

} ; 
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BLOCK (2, rear_blades) { 
divide_block = 5 ; 

PATCH (pi, IJ_FACE, 1, 10, 1,22) ; 

PATCH (p2, 1 J_FACE, 10,20, 1, 11) ; 

PATCH (p3, IJ_FACE, 10, 20, 11, 22) ; 

PATCH (p4,IJ_FACE, 20, 57, 1,22) ; 

PATCH (p5,JK_FACE, 1,22,1,11) ; 

PATCH (p6,KI_FACE, 1,11,1,57) ; 

TERMINAL (hub) {p6-} ; 

TERMINAL (exterior) {p6+} ; 

TERMINAL (front) {p5-} ; 

TERMINAL (rear) {p5+} ; 

TERMINAL (kplus) {pl+,p3+,p4+} ; 
TERMINAL (kminus) {pl-,p3-,p4-} ; 
TERMINAL (blades) {p2+,p2-} ,- 
TERMINAL_ATTRIBUTES (hub) { 

boundary_type = "inpermeable" ; 

} ; 

TERMINAL_ATTRIBUTES (blades) { 

boundary_type = "impermeable" ; 

1 ; 

TERMINAL_ATTRIBUTES (exterior) { 

boundary_type = "inflow-outflow" ; 

} ; 

TERMINAL_ATTRIBUTES (rear) { 
boundary_type = "outflow" ; 

} ; 



ZONE (front) { 

dtdt = -1.620 ; 

GRAB_BLOCK ( f ront_blades ) ; 

CONNECT (front Jblades . kplus, front_blades .kminus) : SYMMETRIC ; 
RINGJBUFFER (ring_buf f er, JK_FACE, KDIM, 05) { f ront_blades . rear } 

} ; 

ZONE (rear) { 

dtdt = 1.620 ; 

GRAB_BLOCK (rear_blades ) ; 

CONNECT (rear_blades .kplus, rear_blades. kminus) :SYMMETRIC ; 
RING_BUFFER(ring_buf fer, JK_FACE , KDIM, 09) {rear_blades . front } 

} ; 

CONNECT (front . ring_buf fer, rear . ring_buf fer) ; 
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